Consolidated Segmentation and Churn Analysis of Bank Clients

By: Tamjid Ahsan


As capstone project of Flatiron Data Science Bootcamp.

ABSTRACT


Attracting new customers is no longer a good strategy for mature businesses since the cost of retaining existing customers is much lower. For this reason, customer churn management becomes instrumental for any service industry.

This analysis is combining churn prediction and customer segmentation and aims to come up with an integrated customer analytics outline for churn management. There are six components in this analysis, starting with data pre-processing, exploratory data analysis, customer segmentation, customer characteristics analytics, churn prediction, and factor analysis. This analysis is adapting OESMiN framework for data science.

Customer data of a bank is used for this analysis. After preprocessing and exploratory data analysis, customer segmentation is carried out using K-means clustering. A Random Forest model is used focusing on optimizing f-1 score to validate the clustering and get feature importance. By using this model, customers are segmented into different groups, which sanctions marketers and decision makers to implement existing customer retention strategies more precisely. Then different machine learning models are used with the preprocessed data along with the segmentation prediction from the K-means clustering model. For this type of modeling, models were optimized for precision. To address class imbalance Synthetic Minority Oversampling Technique (SMOTE) is applied to the training set. For factor analysis feature importance of models are used.

OVERVIEW



head

Customer churn is a big issue that occurs when consumers abandon your products and go to another provider. Because of the direct impact on profit margins, firms are now focusing on identifying consumers who are at danger of churning and keeping them through tailored promotional offers. Customer churn analysis and customer turnover rates are frequently used as essential business indicators by banks, insurance firms, streaming service providers, and telecommunications service providers since the cost of maintaining existing customers is significantly less than the cost of obtaining a new one.

When it comes to customers, the financial crisis of 2008 changed the banking sector's strategy. Prior to the financial crisis, banks were mostly focused on acquiring more and more clients. However, once the market crashed after the market imploded, banks realized rapidly that the expense of attracting new clients is multiple times higher than holding existing ones, which means losing clients can be monetarily unfavorable. Fast forward to today, and the global banking sector has a market capitalization of $7.6 trillion, with technology and laws making things easier than ever to transfer assets and money between institutions. Furthermore, it has given rise to new forms of competition for banks, such as open banking, neo-banks, and fin-tech businesses (Banking as a Service (BaaS))[1]. Overall, today's consumers have more options than ever before, making it easier than ever to transfer or quit banks altogether. According to studies, repeat customers seem to be more likely to spend 67 percent more on a bank's products and services, emphasizing the necessity of knowing why clients churn and how it varies across different characteristics. Banking is one of those conventional sectors that has undergone continuous development throughout the years. Nonetheless, many banks today with a sizable client base expecting to gain a competitive advantage have not tapped into the huge amounts of data they have, particularly in tackling one of the most well-known challenges, customer turnover.

Churn can be expressed as a level of customer inactivity or disengagement seen over a specific period. This expresses itself in the data in a variety of ways e.g., frequent balance transfers to another account or unusual drop in average balance over time. But how can anyone look for churn indicators? Collecting detailed feedback on the customer's experience might be difficult. For one thing, surveys are both rare and costly. Furthermore, not all clients receive it, or bother to reply to it. So, where else can you look for indicators of future client dissatisfaction? The solution consists in identifying early warning indicators from existing data. Advanced machine learning and data science techniques can learn from previous customer behavior and external events that lead to churn and use this knowledge to anticipate the possibility of a churn-like event in the future.


Ref:

[1] Business Insider

[2] Stock images from PEXELS

BUSINESS PROBLEM


head

While everyone recognizes the importance of maintaining existing customers and therefore improving their lifetime value, there is very little banks can do about customer churn when they don't anticipate it coming in the first place. Predicting attrition becomes critical in this situation, especially when unambiguous consumer feedback is lacking. Precise prediction enables advertisers and client experience groups to be imaginative and proactive in their offering to the client.

XYZ Bank (read: fictional) is a mature financial institution based in Eastern North America. Recent advance in technology and rise in BaaS is a real threat for them as they can lure away the existing clientele. The bank has existing data of their clients. Based on the data available, the bank wants to know whom of them are in risk of churning.

This analysis focuses on the behavior of bank clients who are more likely to leave the bank (i.e. close their bank account, churn).

IMPORTS

OBTAIN

The data for this analysis is obtained from Kaggle, titled "Credit Card customers" uploaded by Sakshi Goyal. Which can be found here, this dataset was originally obtained from LEAPS Analyttica. A copy of the data is in this repository at /data/BankChurners.csv.

This dataset contains data of more than 10000 credit card accounts with around 19 variables of different types as of a time point and their attrition indicator over the next 6 months.

Data description is as below:

Variable Type Description
Clientnum Num Client number. Unique identifier for the customer holding the account
Attrition_Flag obj Internal event (customer activity) variable - if the account is closed then 1 else 0
Customer_Age Num Demographic variable - Customer's Age in Years
Gender obj Demographic variable - M=Male, F=Female
Dependent_count Num Demographic variable - Number of dependents
Education_Level obj Demographic variable - Educational Qualification of the account holder (example: high school, college graduate, etc.)
Marital_Status obj Demographic variable - Married, Single, Divorced, Unknown
Income_Category obj Demographic variable - Annual Income Category of the account holder (< $40K, $40K - 60K, $60K - $80K, $80K-$120K, > $120K, Unknown)
Card_Category obj Product Variable - Type of Card (Blue, Silver, Gold, Platinum)
Months_on_book Num Months on book (Time of Relationship)
Total_Relationship_Count Num Total no. of products held by the customer
Months_Inactive_12_mon Num No. of months inactive in the last 12 months
Contacts_Count_12_mon Num No. of Contacts in the last 12 months
Credit_Limit Num Credit Limit on the Credit Card
Total_Revolving_Bal Num Total Revolving Balance on the Credit Card
Avg_Open_To_Buy Num Open to Buy Credit Line (Average of last 12 months)
Total_Amt_Chng_Q4_Q1 Num Change in Transaction Amount (Q4 over Q1)
Total_Trans_Amt Num Total Transaction Amount (Last 12 months)
Total_Trans_Ct Num Total Transaction Count (Last 12 months)
Total_Ct_Chng_Q4_Q1 Num Change in Transaction Count (Q4 over Q1)
Avg_Utilization_Ratio Num Average Card Utilization Ratio

There are three unknown category in Education Level, Marital Status, and Income Category. Imputing values for those features does not make sense. And it is understandable why those are unknown in the first place. Information about Education and Marital status is often complicated and confidential and customers are reluctant to share those information. Same for the income level. It is best for the model to be able to handle when those information is not available and still produce prediction.

For this reason those are not imputed in any way for this analysis.

There is major class imbalance spotted in the target column.

No null values to deal with. Features have the correct data type.

comment

EDA

comment

comment

comment

SCRUB

Class imbalance will be addressed by synthetic oversampling later in this section.

Label encoding

Train-Test split

Encoding & Scaling

Pipeline

SMOTENC

MODEL

Segmentation

Finding "K"

Higher Silhouette Coefficient score relates to a model with better defined clusters. And higher Calinski-Harabasz score relates to a model with better defined clusters.

Although by looking at the visual no obvious optimal K can not be spotted. Based on the Silhouette Score and Sum of squared error (a.k.a. Elbow plot), 5 segmentation seemed optimal for initial model. Calinski Harabasz Score also supports this segmentation.

Customers are segmented by 5 groups by their characteristics.

Among models run for K from a range of 2 to 10, 5 is recommended by yellowbrick package.

Result of MeanShift supports the initial choice of K=5.

Selecting "K"

More insights on the segmentation is in the INTERPRET part of this analysis.

Using principal component analysis concept for reducing features to visualize the clusters in a three dimensional space.

With only forty percent explainability of the entire dataset by PCA, the clusters exhibit a clear separation between them in a three dimensional space. I am content with the selected K of 5. This will be further evaluated when performing inter cluster exploration in later part.

Feature importance

Newly created cluster_df is used to get the feature importance to get insights which features were often used for determining the segmentation. A Random Forest model is used to get feature importance alongside a permutation importance analysis to get the most important features.

Segmentation Characteristics

intra cluster EDA

comment

inter cluster EDA

Exploring features among clusters based on the insights from the feature importance from the previous part of the analysis. Most important features are explored.

Cluster Distribution
Customer Age

Cluster 4 and 1 has similar distribution. Cluster 0 is younger. Cluster 3 is distinct as it is mostly comprised of older clients. Others have similar distribution.

Credit Limit
Avg Utilization Ratio
Months on book

All of them show similar spread except Cluster 3, they are the most loyal clients.

Total_Trans_Amt

Cluster 0 has highest transaction amount. Rest of the has similar pattern.

Avg_Open_To_Buy
Total_Trans_Ct
Total_Revolving_Bal
Total_Relationship_Count

Cluster 0 mostly comprised of lower relationship count clients. Rest of the Clusters has similar distributions.

Dependent_count

All of them are mostly similar.

with churn info

All the features are explored with respect of churning.

comment

name clusters

Prediction

Prediction from the clustering model is used as a feature for modeling churn prediction model. Models without this feature was also experimented. Those models had a slightly worse performance. For the final modeling approach, dataset containing predictions from the kmeans model is used.

Baseline model

The baseline model is performing as par as random chance of flipping a coin for prediction.

Logistic Regression

'Avg_Open_To_Buy' with Credit_limit, 'Card_Category_Silver' with 'Card_Category_Blue, 'Gender_M' with 'Gender_F, 'Months_on_book' with 'Customer_Age', 'Total_Trans_Ct' with 'Total_Trans_Amt features are showing high multicollinearity. Those are expected by the nature of those features.

Multicollinearity undermines the statistical significance of an independent variable. Here it is important to point out that multicollinearity does not affect the model's predictive accuracy. Choosing not to deal with this issue right now.

Model is not good enough to predict target class 1, churned customer. Although accuracy is good.

The accuracy is good enough. But the the residual must be crazy as indicated by the f-1 and precision values. Supports my previous point about model performance. Outlier removal is next. Not pursuing that because data loss will be very high as there are lots of recurring values for the numeric values (lots of zeros) for both IQR and Z-score based approach for outlier removal.

Critical features for churning:

Odds ratios are used to measure the relative odds of the occurrence of the outcome, given a factor of interest [Bland JM, Altman DG.(2000), The odds ratio]. The odds ratio is used to determine whether a particular attribute is a risk factor or protective factor for a particular class and the magnitude of percentage effect is used to compare the various risk factors for that class. The positive percentage effect means that the factor is positively correlated with churn and vice versa.

The odds ratio and percentage effect of each feature are estimated as $\mathbf{OddsRatio} = e^{\Theta }$ and $\mathbf{Effect (\%)} = 100 * (OddsRatio - 1)$, where $\Theta$ is the value of weight of each feature in Logistic Regression model. If the effect is positive, the greater the factor, the likely that the client will churn, those factors are considered as risk factors. While if the effect is negative, the greater the factor, the greater the possibility that the customer will not churn, and can be considered as protective factors. This is a Bayesian approach for identifying feature importance.

Random Forest

OG data

OS data

XGBoost

XGBClassifier

XGBRFClassifier

Best model

INTERPRET

Customer Segmentation model

Churn Prediction model

RECOMMENDATION

CONCLUSION

NEXT STEPS

Modeling aspect: Gaussian Mixture Models for segmentation modeling, and Neural Network based approach for prediction model.

Business need aspect: A part of the business challenge is determining how soon you want the model to forecast. A prediction that is made too long in advance may be less accurate. A narrow prediction horizon, on the other hand, may perform better in terms of accuracy, but it may be too late to act after the consumer has made her decision.

Finally, it is critical to establish whether churn should be characterized at the product level (customers who are likely to discontinue using a certain product, such as a credit card) or at the relationship level (client likely to extricate from the bank itself). When data is evaluated at the relationship level, you gain a wider insight of the customer's perspective. Excessive withdrawals from a savings account, for example, may be used to pay for a deposit on a house or education costs. Such insights into client life events are extremely effective not just for preventing churn, but also for cross-selling complementary items that may enhance the engagement even further.

APPENDIX

all functions and imports from the functions.py and packages.py

Dashboard

Logistic regression with no category dropped for categorical columns

SVC